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Columbia University, Tel- Aviv University and Princeton University 

We study a simple learning algorithm for binary classification. In- 
stead of predicting with the best hypothesis in the hypothesis class, 
that is, the hypothesis that minimizes the training error, our algo- 
rithm predicts with a weighted average of all hypotheses, weighted 
exponentially with respect to their training error. We show that the 
prediction of this algorithm is much more stable than the prediction 
of an algorithm that predicts with the best hypothesis. By allow- 
ing the algorithm to abstain from predicting on some examples, we 
show that the predictions it makes when it does not abstain are very 
reliable. Finally, we show that the probability that the algorithm ab- 
stains is comparable to the generalization error of the best hypothesis 
in the class. 

1. Introduction. Consider a binary classification learning problem. Sup- 
pose we use a hypothesis class 7i and are presented with a training set 
rrn Vva) drawn independently from a distribution D over the ex- 
ample domain Xx{— 1,-|-1}. Most learning algorithms for this problem that 
have been studied in computational learning theory are based on identifying 
the hypothesis h £Ti. that minimizes the training error. One of the main 
problems with this approach is the phenomenon called overfitting. Overfit- 
ting is encountered when the hypothesis class 7i is too "large," "complex" or 
"flexible" relative to the size of the training set. In this case it is likely that 
the algorithm will find a hypothesis whose training error is very small but 
whose generalization error, or test error, is large. To overcome this problem, 
one usually uses either model-selection or regularization terms. Model selec- 
tion methods try to identify the "right" complexity for 7i. A regularization 
term is a measure of the complexity of the hypothesis h that is added to the 
training error to define a cost for each hypothesis. By minimizing this cost. 
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the learning algorithm attempts to minimize both the training error and the 
amount of overfitting. 

However, it is not clear that predicting with the hypothesis that mini- 
mizes the training error is indeed the only or the best prediction. One pop- 
ular alternative to predicting using the single best hypothesis is to average 
the prediction of those hypotheses whose performance on the training set 
is close to optimal. Two popular methods of this type are Bayesian averag- 
ing [15] and bagging [4, 5]. There is considerable experimental evidence that 
such averaging can significantly reduce the amount of overfitting suffered by 
the learning algorithm. However, there is, we believe, a lack of theory for 
explaining this reduction. 

In the context of bagging, the common explanation is based on the argu- 
ment that averaging reduces the variance of the classification rule. However, 
as argued elsewhere [11, 18], there is currently no adequate definition of 
variance for classification problems. In addition, this explanation fails to 
take into account the effect that the complexity of the model class has on 
overfitting. 

In the Bayesian approach the problem of overfitting is generally ignored. 
Instead the basic argument is that the Bayesian method is always the best 
method, and therefore, the only important issues are how to choose a good 
prior distribution and how to efficiently calculate the posterior average. How- 
ever, the optimality of the Bayesian method is based on the assumption that 
the data we observe are generated according to one of the distribution mod- 
els in the chosen class of models. While this assumption is attractive for 
theory, it almost never holds in practice. In practice, one usually uses rel- 
atively simple models, either because there is not enough data to estimate 
the "true" model, because the computational complexity is prohibitive, or 
because our prior knowledge of the system is only partial. Even when very 
complex models are used, it is rarely the case that one can assume that the 
data are generated by a model in the class. As a result, Bayesian theory is 
inadequate for explaining why Bayesian prediction methods are better than 
predicting with the best model in the class. 

In this paper we propose a prediction method that is based on averaging 
among the empirically best classification rules. This method is similar to, 
but different from, the Bayesian method. The advantage of this method is 
that we can theoretically justify its usage without making the aforemen- 
tioned Bayesian assumption that the data is generated by a distribution 
from a given class of distributions. Instead we make the following weaker as- 
sumptions which are common in the context of empirical error minimization 
methods. First, we assume that the data is generated i.i.d. according to the 
distribution D defined above but make absolutely no assumption about D 
other than that it is a fixed distribution. Second, we choose a class of pre- 
diction rules (mappings from the input to the binary output) and assume 
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that there are prediction rules in that class whose probability of error (with 
respect to the distribution D) is small, but not necessarily equal to zero. 

We deviate from the analysis used for empirical error minimization meth- 
ods in our definition of a classification rule. In the context of a binary pre- 
diction problem, we allow the classifier three possible outputs. Two of them, 
— 1 and +1, are interpreted, as before, as predictions of the label. The third, 
denoted by 0, should be interpreted as "no prediction" or "insufficient data." 

What is the benefit of allowing the predictor this new output? The ad- 
vantage is that it allows the user of the classifier to identify those examples 
on which overfitting might occur. For example, suppose that the best hy- 
pothesis h* in our hypothesis class 7i has an expected error of 1%. Suppose 
further that the size of the training set and the complexity of TC are such 
that the hypothesis that minimizes the empirical error h* is likely to have 
a generalization error of 5%. If we use h* to make our predictions, then the 
most we can hope to get from a uniform-convergence type analysis is an 
upper bound on the generalization error that is close to 5%; we have no 
way of identifying where these errors might occur. On the other hand, if we 
allow the algorithm to output a zero, we can hope that the algorithm will 
output zero on about 4% of the input, and will be incorrect on about 1% of 
the data. In such a case, we say that the classifier identifies the locations of 
potential overfitting and allows the user to choose a special course of action 
for this case (such as referring the example back to a human to make the 
classification). In this case we can justifiably say that the algorithm man- 
aged to avoid overfitting. It is not misleading us into thinking that we have 
a classifier that is very accurate just because its error on the training set is 
small. 

As a toy example, Figure 1 shows a tiny learning problem in which positive 
and negative training examples are indicated by pluses and minuses. In this 
example hypotheses are represented by rectangles, and we suppose that there 
is a large space of rectangular hypotheses, the best three of which are shown 
in Figure 1. Each of these makes two mistakes on this data set. However, if 
we take an average of hypotheses, one can imagine that it would be possible 
to obtain a combined classifier that abstains on all points in the shaded 
region where there is likely to be disagreement among the hypotheses, and 
predicts according to the weighted majority elsewhere. Such a combined 
classifier, when it does not abstain, would give nearly perfect predictions, 
having successfully identified the regions where errors are most likely to 
occur. 

Of course, if the generated classifier outputs zero most of the time, then 
there is no benefit from having it. We need to show two things to be con- 
vinced that the addition of the new output is useful. First, we need to show 
that the probability of outputting a zero is of the same order as the bounds 
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on overfitting that we would get from an analysis based on uniform con- 
vergence. Second, we need to show that when the output is +1 or —1, the 
probability of making a mistake is similar to the generalization error of the 
best hypothesis in the class. In this paper we prove that our algorithm has 
both these properties in the case that TC is a finite class of models. In fu- 
ture work we hope to show how this work can be extended to infinite model 
classes. 

If 7i is finite, the uniform convergence bound is the well-known Occam's 
razor bound [2]. If TC is infinite, we have to resort to bounds based on 
VC-dimension [21]. Unfortunately, these bounds are usually very loose and 
provide very poor estimates for the generalization error of learning algo- 
rithms in real- world applications. 

In recent years, researchers in computational learning theory have started 
to consider algorithms that search for a good classification rule by optimiz- 
ing quantities other than the training error. Algorithms of this type include 
support-vector machines [21] and boosting [18] which maximize the "mar- 
gin" of a linear classifier. Other work by Shawe- Taylor and Williamson [20] 
and McAUester [16] provide PAC-style analysis of Bayesian algorithms. Bayesian 
algorithms compute the posterior distribution over the space of hypotheses 




Fig. 1. A toy example. 
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and predict by averaging the predictions of all hypotheses whose training 
error is close to the minimum. Another work that is relevant here is the 
work by Bousquet and Elisseeff [3] on the relationship between stability and 
generalization in learning classification rules. 

In this paper we study a learning algorithm that is very similar to the 
algorithm that would be suggested by Bayesian analysis but uses a slightly 
different formula for computing the posterior distribution. This formula is 
the "exponential weights" formula introduced by Littlestone and Warmuth 
in the context of the weighted-majority algorithm [14] and further analyzed 
by Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth [6]. 
Note, however, that we are generating a fixed classification rule and are 
therefore working in the standard batch learning model and not in the online 
learning model. 

The analysis of the algorithm consists of two parts. First, we consider, 
for each instance x, the log of the ratio of the total weight between those 
hypotheses that predict +1 on x and those hypotheses that predict — 1, 
where the weights depend on a parameter t]. We denote this ratio by i{x). 
We prove that i{x) is rather insensitive to the random choice of the training 
set. In particular, we prove that the variation in l{x) is independent of the 
concept class Til This proof is interesting because it avoids using the standard 
"union bound;" in fact, it altogether avoids making any uniform claim on 
all of the hypotheses in 7i. 

Using this central theorem, we can show that if i{x) is far from zero, then 
predicting with sign(^(x)) is very stable, that is, is unlikely to change from 
training set to training set. More precisely, we introduce a nonstochastic 
quantity (.{x) and show that (.{x) is, with high probability, very close to 
(.{x). Our algorithm predicts with sign(£(a;)) when is far from zero 
and abstains from prediction when (.{x) is close to zero. We prove that the 
probability that this algorithm makes a prediction different from sign(£(a;)) 
when it does not abstain is very small. On the other hand, we show that if 
H. is finite and there is a hypothesis h€TC whose error is e, then we can set 
the parameter 77 such that the error of sign(£(x)) is at most about 2e. 

The relation between our algorithm and algorithms that predict with 
the best hypothesis on the training set has a close correspondence to the 
relation between Bayesian prediction algorithms and MAP (maximum a- 
posteriori) algorithms. However, the analysis is carried out without making 
a Bayesian assumption, that is, we do not assume that the training data 
are generated by a model in a pre-specified class chosen by a pre-specified 
prior distribution. The prior and posterior distributions are internal to the 
algorithm and are not part of the world around it. 

We hope that this paper will shed some new light on the use of algorithms 
that average many hypotheses such as Bayesian algorithms and averaging 
methods such as bagging [4, 5]. 
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The paper is organized as follows. We start in Section 2 by describing 
the prediction algorithm. We give the basic analysis of the algorithm in 
Section 3. In Section 4 we bound the performance of i{x) in terms of the 
error of the best hypothesis in the class. In Section 6 we give a bound that 
is uniform with respect to the learning rate parameter r] which makes it 
possible to choose this parameter after observing the training set. Finally, 
in Section 7 we outline how the ideas and results in Sections 2-4 can be 
extended to infinite hypothesis classes. 

2. The algorithm. Let D be a fixed but unknown distribution over (x, y) 
pairs, where x £ X and y £ {—1, +1}. Let TC he a fixed class of hypotheses, 
that is, mappings from X to {—1, +1}. Let S denote a sample of m training 
examples, each drawn independently at random according to D. We de- 
note the true error of a hypothesis h by e{h) = Pr^^, y)^^) [h{x) ^ y] and the 
estimated error according to the sample S by s{h) = ^ / y]- 

The prediction algorithm that we study calculates for each hypothesis h 
a weight that is defined as w{h) = e~^^^^\ where r/ > is a parameter of the 
algorithm. The prediction on a new instance x is defined as a function of 
the empirical log ratio: 



i^{x) = -ln( 



1 Eh,h{x)=+ie 




The prediction is defined to be 

^ fsign(^>)), if |i(x)|>A, 

1 0, otherwise, 

where A > is a second parameter of the algorithm. Intuitively, the param- 
eter A characterizes the range of values of Irjix) in which the training data 
is insufficient to make a good prediction and a better choice is to abstain. 
When clear from context, we generally drop the subscripts and write simply 
£{x) and p{x). 

3. Analysis of the algorithm. For an instance x, we define the true log 
ratio to be 

„ . . . 1 , Efe,h(x)=+i e-''^('') 

Iriix) = -In— — TT^T-) 

which we often write as i{x) when r] is clear from context. The basic idea 
of our analysis is to show that i{x) must usually be close to i{x) with high 
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probability. In particular, we will prove the following two theorems. First, 
we will prove that for any fixed x the difference between the empirical log 
ratio and the true log ratio is small: 



Theorem 1. For any distribution D, any instance x, any A,?7 > and 
any s G {—1, +1}.' 



Pr 



s{l{x) - i{x)) > 2A + 

0771 



< 2e 



-2A2r 



Then, in order to show that our algorithm has reasonable performance, 
we will transform Theorem 1 which gives a guarantee that holds with high 
probability for any fixed instance to a claim that holds with respect to a 
randomly chosen instance: 



Theorem 2. For any (5 > and rj> 0, if we set 

V m 8m 

then, with probability at least 1 — 6 over the random choice of the training 
set, 

Pr \p{x) / and p{x) / sign(^(x))] < 6. 

This theorem guarantees that, when our algorithm predicts something 
different than (which can be interpreted as "I do not know"), it is very 
likely to be making the same prediction as i{x). Note that the statements 
of Theorems 1 and 2 have no dependence on the hypothesis class 7i. In fact, 
the theorems and their proofs can be extended to infinite hypothesis classes, 
as discussed in Section 7. 

We define some notation that will be used in the proofs. For ICQTC, let 

and let Rrj{K,) be the random variable 

' \ heic / 

We show that -R,,(/C) is close to R^{1C) (with high probability) in two steps: 
First, we show that Rr^{JC) is close to its expectation E[i?^(/C)] with high 
probability. Then we show that E[i?^(/C)] is close to i?^(/C). 
To prove the first result, we apply McDiarmid's theorem [17]: 
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Theorem 3 (McDiarmid). Let Xi, . . . ,Xm be independent random vari- 
ables taking values in a set V . Let f : V"^ -^M be such that, for i = 1, . . . , ^7i.• 
| /(xi , . . . , /(xi , . . . , Xj— 1 , Xj , Xi-\-l , ■ ■ ■ , Xm) I ^ Cj 

for all xi, . . . , Xm',x[ G V . Then for e > 0, s G {— 1, +1} 

2e2 



Pr[s(/(Xi, ...,Xm)- E[/(Xi, . . . ,X^)]) > e] < exp 



Lemma 1. Let IC and Rri{IC) be as above for a sample of size m. For 
r]>0, A > and s £ {-1,+1}, 



Pr [s(i?^(/C) - E[^^(/C)]6) > A] < e- 



Proof. We apply McDiarmid's theorem with the Xj's set to the labeled 
examples of S, and the function / set equal to the random variable i?^(/C). 
Let S' be the sample S in which one example {xi,yi) is replaced by (x^,?/^. 
Let i'{h) be the empirical error of h on S' , and let 



' \h&K. / 



Then 



1 (Tu ^e-'^^'^^A 

<-\ii( maxe-^(^"('^)-"'W)^ 
1] \ heic J 

= max(e'(/i)-e(/i))< 

The first inequality uses the fact that ^j) ^ maxjaj/^j for pos- 

itive Oj's and 6j's. The second inequality uses the fact that changing one 
example can change the empirical error by at most 1/m. 

By the symmetry of this argument, |i2^(/C) — i?^(/C)| < 1/m. Plugging in 
Ci = 1/m in McDiarmid's theorem gives the result. □ 



Lemma 2. Let IC, RrjilC) and RrjilC) be as above for a sample of size m. 
Then for r] >0, 

i?^(/C)<E[i?^(/C)]<ii^(/C) + -^. 
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Proof. For the lower bound on E[^^(/C)], let /C = {/ii, . . . , /iat}. For 
xGM^, let 



5(x)=ln(^f:e-»^ 



Then g is convex: Given a G (0, 1) and x,y G M^, let p = l/a, q = 1/(1 - 
a), and define = e"^^ and Si = e^^"")^*. Since 1/p + 1/q = 1, by Holder's 
inequality, 



i \ i / \ i 



1/9 



Plugging in definitions and taking logarithms, this is equivalent to 

5(ax + (1 - a)y) < ac/(x) + (1 - a)g{y), 

so g is convex as claimed. 

Therefore, by Jensen's inequality, 

r?E[i?^(/C)]/C = E[g{{-7]i{hi), (/itv)))] 

>5((-r?E[f(/ii)],...,-7?E[e(/i7v)])) 

= 9{{-r]e{hi), -r/e(/iAr))) = r/i?^(/C). 

To prove the upper bound on E[i?^(/C)], we have by Jensen's inequality 
(applied to the concave log function). 



E[R,{)C)] = -E 



(1) 



-rii{h) 



<ilnf^E[e-''^-W]). 



Fix h and let e = e{h) and e = i{h). Let be a Bernoulli random variable 
that is 1 if h{xi) ^ Ui and otherwise. Then we can write 



E[e''(^-^')] =E 



exp 



IL 

m 



1=1 



exp — (e - Zi) 



i=l 
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The second equality uses independence of the Zj's. The last step uses the 
fact, proved by Hoeffding [13], that for any random variable X with E[X] = 
and a < X <h, 

E[e^]<e(''-")'/8. 

Here we let X = {i]/m){e — Zi). 

Thus, E[e-^^"('^)] < e'?V8mg-r?£(fc)_ Combined with (1), this gives that 

J] 



^ \ heic J 



8m 



as claimed. □ 



Proof of Theorem 1. Given x, we partition the hypothesis set H 
into two. The subset /C includes the hypotheses h such that h{x) = +1 and 
its complement IC^ includes all h for which h{x) = — 1. We can now write 



(2) 



i(x)-i(x) = -ln\ 
V 



heK' 



-rie{h) ' 



-r]i{h) 



-rie{h) ' 



Combining Lemmas 1 and 2, we find that 

(3) Pr[i?r,(/C) - i?^(/C) > A] < 

and 



~rie{h) 



(4) 



Pr 



Rr^{K^)-Rr,{lC'')>\ + 



8m 



< e 



Combining (2)-(4), we prove the claim for s = +1. The proof for s = — 1 is 
almost identical. □ 



Lemma 3. For any distribution D, any X,ri > and any s G {— 1,+1}, 
the probability over samples S ~ D"^ that 

ri 



IS 



at most \f2e 



Pr 

(x,j/)~D 



s(^(x)-£(x))>2A + 



8m 



> ^26-^'"^ 



Proof. Since Theorem 1 holds for all .x, it also holds for a random x. 
Thus, 



E 



Pr 
E 

(x,j/)~D 



s{i{x)-l{x))>2\^-^ 
8m 



Pr 



s{i{x)-i{x))>2X + 



8m 



< 2e 
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The lemma now follows using Markov's inequality. □ 
Theorem 2 follows immediately from Lemma 3. 

4. Performance relative to the best hypothesis. We now show that there 
exists a setting of rj and A that yields performance guarantees relative to 
the best hypothesis in the class. We compare these guarantees to those given 
by the Occam argument [2] for the algorithm that uses a hypothesis that 
minimizes the empirical error rate. 

In Lemma 3 we showed that the value of i{x) is, with high probability, 
close to i{x). We now show that, with respect to the actual distribution 
D, the sign of i{x) is closely related to that of the best hypothesis in 7i. 
By combining these theorems, we show that the generalization error of our 
algorithm is close to that of the best hypothesis in TC. 

Note that the following theorem does not involve the training set in any 
way; it is a claim about yi{x) which is a deterministic function of {x,y). 
Intuitively, for large enough values of r], the function l{x) essentially averages 
the best hypotheses from 7i. In the worse case, as we show in Section 5, this 
can at most double the error. The following theorem gives a detailed tradeoff 
between all the parameters. 

Theorem 4. Let Ti. be a finite hypothesis class and let e be the error of 
the best hypothesis in Ti with respect to the distribution D over the examples, 
that is, e = min{e(/i) : h G TL}. Let r] > and A > 6e such that Ar] < 1/2. 
Then for any 7 > ln{8\7i\) / r] , 

Pr <0] <2(l + 2|?^|e-'''^)(e + 7), 

and 

Pr [yi{x) < 2A] < (1 + e^'^''){l + 2|?^|e''(2A-7))(g ^ ^) 

(x,y)^D 

<4(l + 2|W|e''(2^-^))(e + 7). 

Proof. We partition the hypotheses in TC into two sets according to 
their true error. We call those hypotheses whose error is smaller than e + 7 
strong and the other hypotheses weak. 

We denote by W^, the total weight of the weak hypotheses: 

h£H:e{h)>€+j 

where 

hen 
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To upper bound Ww, note that we always have at least one strong hypothesis, 
namely, the one that achieves e{h) = e. Thus, 



(5) 



|-H|p-r?(«+7) 

e 



From the assumption that 7 > ln{8\TC\) / 1] , we get that < 1/8. 

For a given example {x,y), we partition the strong hypotheses into two 
subsets according to whether or not the hypothesis gives the correct predic- 
tion on {x,y). We denote the total weight of these subsets by 



z 



heH:e{h)<e+y,h{x)=y 



heH:e{h)<e+'y,h{x)j^y 

By the definition of Z, for any {x,y), 

W+{x,y) + W-{x,y) + W^ = l. 

We now prove the second part of the theorem; the first part follows from 
the second part by setting A = 0. We first bound yi{x) using Ww, W^{x,y) 
and W-{x,y): 

W+{x,y) 



yi{x) > — In 



Ws-{x,y) + Wu 



Thus, yi{x) < 2A implies 



or, equivalently, 



W-{x,y) + W^ 
l-{Ws-{x,y) + W^ 

W-{x,y) + W^> 



>e 



-2Ar] 



■ C. 



1 + e2^^ 

We denote by /i ~ 5 the random choice of a hypothesis from the strong 
set with probability e~^^^^^ /Zs, where Zg normalizes the weights within the 
strong set to sum to 1. We find that 



Pr [ye{x) < 2A] < Pr 

{x,y)r^D {x,y)~D 



Pr 

{x,y)^D 



{x,y) 



> 



Ws-{x, y) + W+{x, y)- l-W^ 



Pr \h{x) / yl > 5LJ^ 



(6) 



< E 



Vv[h{x)^y] 
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(7) 

(8) 
(9) 



E 



Vr [h{x)^y] 



c - ly,,: 



<(e + 7) 



C - W,,: 



<(e + 7)(l + e2^'')(l + 2W;„e2^^). 



Equations (6) and (7) use Markov's inequality and Fubini's theorem. Equa- 
tion (8) follows from the fact that < e + 7 for every strong hypothesis. 
Equation (9) uses our assumptions that A?^ < 1/2 and < 1/8 together 
with the inequality (1 — x)/(l — x{\ + r)) < 1 + 2xr for x > 0, r > and 
x{l + r) < 1/2 (with x = and r = e^^''). 

Combining this bound with (5) proves the second statement of the theo- 
rem. □ 

5. Discussion. We now discuss the implications of Theorems 2 and 4. We 
start with a corollary of Theorem 4 for a specific setting of the parameters 
r] and A as a function of the sample size m, the size of the hypothesis class 
7i and the reliability parameter 5. 

Corollary 1. Le^ 1/2 > ^ > 0, 5 > and 
r? = ln(8|W|)mi/2-^ 



/ln(^/2/5)^ln(8|?^|) 



m 



For m> 8, 



{x,y)^D 

and for 



we have 



Pr [y£{x) < 0] < 2 + 



4m 



e + 



m 



1/2- 



+ 



8mV2^ 



Inm 



m > 



8\/ln(^ )ln(8|W|) 



i/e 



Fr [y£{x)<2A]<5(e + 2A + ^^ 



Proof. To prove the corollary, we use Theorem 4 with two different 
settings of 7. The first bound is a result of choosing 7 = l/m^^"^"^ + Inm/ 
{m^/'^-^ln8\n\), and the second is a result of choosing 7 = 2 A + m^ □ 



We now discuss the significance of each statement in the corollary. Let us 
fix the reliability parameter 5. 
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The first statement of Corollary 1 shows that the sign of the true log ratio 
is a reasonably good proxy for the best hypothesis in the class, denoted h* . 
Specifically, the error of sign(^(x)) is 



Let us separate between abstaining and making a mistake. If the algorithm 
outputs we say that it "abstained," while if it outputs —1 or +1 and this 
label does not agree with the actual label of the example, then we say that it 
"made a mistake." Combining this with the statement of Theorem 2, we find 
that the probability that our algorithm makes a mistake on a test example 
is bounded by 



Note that this bound is independent of \TC\. 

In comparison, the upper bound on the hypothesis that minimizes the 
empirical risk is 



We see that the dependence on m here is slightly better, but the bound 
depends on the hypothesis class, which is what we expect from an algorithm 
that cannot abstain. 

For our algorithm, the dependence on j'Hl instead appears in the bound on 
the probability of abstaining on a test example; this is given in the second 
statement of the corollary. Combining that statement with Lemma 3, we 
find that for 



This bound is similar to the Occam bound (11), but the choice of 9 makes 
an important difference in the dependence on m. 

In effect, we are replacing one type of guarantee with a different one. 
In the traditional analysis that is based on uniform convergence theory, 
the guarantee is of the form "the error of the classification rule is at most 
€ -\- 0{ln{\H\) / y/m) Our algorithm is one for which there are two guaran- 
tees. First, we can say that "the error of the classification rule, when this 
rule makes a nonzero prediction, is at most 2e + 0{l/^/rn) (no dependence 




(10) 






m = n{{VHT/^H\'H\)r ) 

our algorithm will predict zero with probability at most 
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on the size of H here). Second, we can show that the probabUity that the 
classification rule will generate a ("I do not know" prediction) is upper 
bounded by 5e + 0{\n{\H\) / ^/m) . This second bound does depend on the 
size of H. Note that this quantity (the probability of predicting 0) can be es- 
timated from an unlabeled set of instances. Unlike the event of a classification 
mistake, which depends both on the predicted label and the actual label, 
the event of predicting does not depend on the actual label. In practice, 
unlabeled data is usually much more plentiful than labeled data. Therefore, 
in practice, we can estimate the probability of abstaining directly and do 
not need to use a priori bounds. 

We now argue that the factor of 2 in front of the error of the best hypoth- 
esis in the class which appears in the first part of the corollary is necessary. 
Suppose that the input domain X is partitioned into two parts A\ and A2, 
such that D{Ai) = 1 — 2e and D[A2) = 2e. Suppose that all the hypotheses 
in Ti predict correctly on instances in Ai. For each x £ A2, the prediction of 
each hypothesis is chosen independently at random to be correct with prob- 
ability 1/2 — r]A and incorrect with probability 1/2 + r]A. (Suppose further 
the number of elements in A2 and the number of hypotheses are sufficiently 
large so that on most of the points in A2 the actual fraction of correct pre- 
dictions is sufficiently close to 1/2 — ijA.) In this case each of the hypotheses 
in Tl has error close to 2e{l/2 + i]A) ~ e(l + 0(m~^)). This also implies that 
all of the hypotheses have approximately the same weight. 

Consider now the value of i{x) for x £ A2. As the weights of all of the 
hypotheses are similar, we get that 

Vx.A. „<W»il„(l^).-4A^ 

As i is likely to be very close to £, we conclude that for x £ A2 our algorithm 
will usually make a nonzero prediction that is incorrect. In other words, 
our algorithm will have a prediction error of about 2e while each of the 
hypotheses has error of about e. 

It may seem impossible that the bound in (10) is independent of the 
number of hypotheses. First, one should recall that a similar phenomenon 
exists in the large margin analysis for hyperplanes, where the generalization 
error depends only on the margin and not on the dimension of the class. 
One should not interpret our result as suggesting that overfitting can never 
happen, regardless of the complexity of the hypothesis space. In truth, if 
the hypothesis space is too complex, the algorithm will simply abstain more 
often. For example, suppose that the hypothesis space consists of all binary 
functions on a finite domain. For any set of training examples, there is a 
function that has zero training error (assuming no example appears twice 
with different labels). However, we expect any algorithm to be unable to 
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predict the label of a new test example. Indeed, in this case, our algorithm 
will abstain on all unseen examples [since l{x) is exactly zero outside the 
training set]. 

Using the size of the hypothesis class as the measure of its complex- 
ity is clearly a very rough upper bound. For example, consider the case 
in which a large fraction of the hypotheses in H are all equal, or almost 
equal, to a single function h* . It is not hard to see that in this case our 
prediction algorithm, as stated, will have a strong bias towards predicting 
like h* . This bias can be removed by replacing the set of almost identical 
hypotheses by the single hypothesis h* . Doing this also improves the guar- 
anteed performance bounds because it reduces \H\. A systematic way for 
removing this type of bias is to replace H with an e-net that covers it. In 
other words, find a set of functions H^, such that for any h£ H there exists 
f £ such that P^(x,y)^D [h{x) / fix)] < e. Of course, choosing an e-cover 
requires knowledge of the marginal distribution over x defined by D and is 
a nontrivial computational problem. Potential future research regarding the 
use of e-covers in conjunction with our prediction algorithm is discussed in 
Section 9. 

Finally, Theorem 4 shows that the error of our predictor cannot be much 
worse than twice the error of the best hypothesis. On the other hand, it 
is possible in some favorable situations for our predictor to significantly 
outperform the best hypothesis. For example, suppose that there is an h* £ 
H such that e{h*) = 1/8, and that for each h eH' = H - {h*}, we have 
e{h) = 1/4. Suppose further that for each x, the fraction of h £ 7i' with 
the right label is 3/4. Choosing the hypothesis with lowest observed error 
would give, hopefully, the hypothesis h* that has an error rate of 1/8. In 
our setting, for a labeled example (x,y), if h*[x) = y, then 

1 /e-''/8 + (3/4)|?^'|e-''/4^ 



" ^ ^ 7] \ 1/4 W J 



7] V (l/4)|W|e 

Thus, for 1] = 1, we have y£{x) = ln(3 + 4e^/^/|W|). Similarly, if h*{x) / y, we 
have yi{x) > ln(3 — 12e^^^ /\7i'\). Note that this implies that Pi,oix) correctly 
classifies all the examples (for \Ti.\ large). Theorem 1, with A set to a constant, 
then guarantees for m = 0(\gl/d) that pifi{x) has an error rate of at most S. 
The important point here is that by averaging a large number of suboptimal 
hypotheses we achieve a prediction accuracy that is better than that of the 
optimal single hypothesis h* . 

An interesting question was raised by one of the reviewers: why are we 
comparing the performance of our algorithm to that of the optimal single 
prediction rule when, in fact, one would expect a rule that is a combination 
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of many prediction rules to perform much better than any single rule? Our 
answer is that in this work we wanted to relate our bounds to those that 
are proven using uniform-convergence analysis of the type advocated by 
Vapnik [21], and those have as their "gold standard" the performance of 
the optimal hypothesis. A natural direction for future research would be to 
compare the performance of our algorithm to that of the rule: 

sign lim Irtix) ] , 

which is the analog of our prediction rule when the distribution is known 
(or equivalently, in the limit of an infinite number of training examples). 
However, it is not clear whether this is the correct gold standard to use. 



6. Uniform bounds. The bound given in Lemma 1 applies to the case 
in which the parameter rj is fixed ahead of time so that i?,,(/C) converges to 
E[i?^(/C)] for only a single value of rj. In the next lemma we show that on a 
single sample this convergence is likely to take place for all values of > 1 
simultaneously. (We can prove a similar result for rj > using a slightly more 
complicated proof. However, because rj is typically large in this paper, we 
omit this proof.) The proof of this is primarily taken from Allwein, Schapire 
and Singer [1]. 



Lemma 4. Let K, and Rri{K,) he as above for a sample of size m. For 
A>0, 



Pr[3r/> 1:|^^(/C) - E[i?^(/C)]| > A] < 



81n|/C| 
A 



The proof is given in the Appendix. 

We can now state the following theorems similar to Theorems 1 and 2. 
These theorems show that it is possible to design an algorithm that chooses r] 
after the sample has been chosen without paying a large penalty in accuracy. 



Theorem 5. Let K, and Rri{IC) be as above for a sample of size m. For 
any distribution D, any A > and any s G { — 1,+1}, 



Pr 



3v>'^-s{in{x)-£nix))>2X + -^ 

8m 



< 



;ln|/C| 



Theorem 6. For any 5 > 0, if we set 
A 



2. llnf"""'"l"l 



m 
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then, with probability at least 1 — 5 over the random choice of the training 
set, for all r]>l 

7. Infinite hypothesis classes. The ideas and results of Sections 2-4 can 
be directly extended to infinite, even uncountable, hypothesis spaces. To 
make this extension, we need to add as a parameter of the algorithm a finite 
measure /i over the hypothesis space 7i. For convenience, we assume in fact 
that /i is a probability measure so that 

^l{^L)= [ dfi = l. 
JH 

Naturally, we will require certain measurability assumptions so that ev- 
erything is measurable that needs to be so. For our purposes, it is sufficient 
that the following sets are measurable: 

{/iG W:/i(x) = +!}, forahxGX, 

{/iGW:e(/i) <e}, for ah e G M. 

In other words, these sets are assumed to be elements of the cr-algebra over 
which the measure /i is defined. 

The results for finite 7i presented earlier in the paper are, of course, a 
special case in which is the uniform discrete measure /i(/C) = l/CI/l^l for 
all lC(^n. 

Formally, the measure ^ is used much like a Bayesian prior. However, 
unlike a prior, we do not assume that there is a target hypothesis in 7i that 
has been chosen randomly according to /i. 

The algorithm in Section 2 can now be extended by simply redefining the 
empirical log ratio to be 

^{h■.h{x)=+l}w{h)d^l \ 

ir,{x) = - In — — , 

V \j{h:h(x) = ~l}<^)df^J 

where as usual w{h) = e~'^^^^'^ and the integral is the Lebesgue integral with 
regard to the probability measure. The true log ratio irjix) is redefined 
analogously. 

To prove Theorems 1 and 2 and Lemmas 1-3 in this more general setting, 
we simply need to replace each sum of the form J2heK:fW the integral 
Jj^f{h)dfi for measurable sets /C. [If /C has measure zero, then Rr){JC) and 
Rr]{K,) are both defined to be zero.] 

The only potential difficulty occurs in proving in Lemma 2 that Rr^iJC) < 
E[i?^(/C)]. When /C is finite, we can simply apply Jensen's inequality to a 
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function of |/C| real variables. When fC is infinite, however, this may be a 
problem since standard forms of Jensen's inequality do not apply. Neverthe- 
less, we can effectively reduce to the finite case as follows: 
Let 5 > 0. Let 

Bi = {helC:i5<e{h) < {i + l)6}. 

Since e{h) G [0, 1], Bq, . . . ,Bk form a partition of /C for k = [l/5\. For ^{Bi) > 0, 
define Si to be a random variable that is the average of e{h) over h £ Bi, 
that is. 



Then 



<{i + 1)5. 



Combined with the fact that e{h) > i6, for h£ Bi, gives 

<^^(Bi)e-'?(E[e,]-5) 

where it is understood that all sums are over i for which n{Bi) > 0. Thus, 

RJIC) = -ln I e-''^(^)d/x 
ri Jk 

<5+ilny//(^i)e-''E["''l 
r] ^ 

<5 + -E[ln^/x(^i)e-''^'' 



(12) 



:5+-E 



ln^/i(^i)exp 



(13) 



<5+-E(ln^^(fii 



-r)i{h) dfj. 



5 + ^E(^ln J e"''^'^'^)'^''^ 

:5 + E[i?^(/C)]. 
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Equation (12) uses Jensen's inequality applied to the convex function 

i 

(Convexity follows from a minor modification of the proof given in Lemma 2 
for the function g.) Equation (13) applies Jensen's inequality to the convex 
function e^. Since 5 is arbitrary, the result follows. 

The results in Section 4 compare performance to that of the best single 
hypothesis. When 7i is infinite, this comparison may be meaningless since 
this single hypothesis is likely to have measure zero. Moreover, the bounds 
in Section 4 are in terms of 1"^! which will now be infinite. 

Therefore, rather than comparing to a single best hypothesis, we compare 
to a set of good hypotheses. In particular, for any e > 0, let be the volume 
of all hypotheses with error at most e: 

V, = ^{{h:e{h)<e}). 

Then throughout this section we need to replace |/C| with 1/V^. 

Specifically, the generalization of Theorem 4 becomes the following: 

Theorem 7. Let TC be any hypothesis class. Let e > and let = 
fi{{h : e{h) < e}). Assume e is large enough that VJ, > 0. Let ?y > and A > 
be such that Ar] < 1/2. Then for any 7 > ln(8/V^)/77, 

, Pr Wx)<0]<2(l + (2/K)e-^^)(e + 7), 

and 

Pr [y£{x) < 2A] < (1 + e^^''){l + (2/y,)e^(2A-7))(g + ^) 
<4(l + (2/y,)e'?(2A-7))(g + ^), 

The modification of Corollary 1 is immediate. In the discussion following 
Corollary 1, e(/i*) is replaced by e as in Theorem 7. 

Besides replacing \TC\ by l/V^, the proof of Theorem 4 only needs to be 
modified by replacing all sums with integrals. Also, to upper bound Ww, we 
lower bound Z by V^e"'''^, a fact that follows immediately from the definition 
of K- 

Generalizing the results of Section 6 to infinite class TC seems harder and 
remains as an open problem for future research. 
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8. Conclusions. In this paper we present a new algorithm for prediction 
of binary functions using a weighted vote over all prediction rules within a 
class. We have shown when, and in what sense, this algorithm can perform 
better than the more common approach of choosing the prediction function 
which performs best on the training data. 

While this algorithm is similar in spirit to a Bayesian prediction algorithm, 
there are at least two important differences. 

The first difference is in the dependence of the posterior probability (be- 
fore normalization) on the size of the training set m. In most Bayesian 
algorithms the expected value of the unnormalized posterior probability for 
any particular model decreases at the rate exp(— c(0)m), where c(0) is 
the expected value of the log probability of the data given the model. In 
our algorithm the rate of decrease is (approximately) exp(— c(0)-^/m ). We 
choose this rate (Corollary 1) so that the variance of the empirical log-ratio 
is slowly decreasing, which results in an estimator whose stability improves 
as the size of the sample increases. 

Second, the goal of our algorithm is to increase the stability of the pre- 
diction and not to optimize a Bayesian measure of risk. To that end, the 
only assumption regarding the data generation mechanism that we make in 
our analysis is that the data is generated in an IID fashion. To the best of 
our knowledge, all existing Bayesian analysis (other than on-line prediction 
methods) make the assumption that the data is generated by one of the 
models in the class over which the Bayesian averaging is performed. In this 
context it is worthwhile to mention recent work by Bousquet and Elisseeff [3] 
in which they show how improved generalization bounds can be proven for 
algorithms that are known to be stable. The main difference between that 
work and our work here is that we describe and analyze a specific averaging 
method that is guaranteed to be stable. 

It was suggested that the main reason that our algorithm does not over- 
fit has to do with the fact that we allow abstention, rather than with the 
averaging of many hypotheses. We believe that the most important property 
of our algorithm is the stability of the empirical log-ratio. Abstention is just 
one way of utilizing this stability. In other scenarios one may be better off 
using the log-ratio scores differently. For example, if the goal is to detect a 
rare type of instance within a large set, the correct method might be to sort 
all instances according to their log-ratio score and output the instances with 
the highest scores. 

It is natural to think of the empirical log ratio as an estimate of the 
conditional probability of the label y given the instance x. However, one 
should not take this intuition too far. The log ratio is a measure of the 
model uncertainty by which we mean the uncertainty in the identity of the 
best model which results from the finite size of the training set. It does not 
measure the uncertainty that is inherent in the true conditional distribution 
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of y given x. To realize this, consider a class with 100 rules in which one rule 
has a true error of 10%, while the true error of each of the other 99 rules is 
larger than 20%. Then with a training set with a few hundred examples the 
weight assigned to the best rule is likely to be larger than the total weight 
of all of the other rules. This in turn would imply that the log ratio would 
be very far from zero everywhere and our algorithm will always predict like 
the best rule and never abstain. Indeed, we can interpret the log-ratio values 
as an indication that we are certain which is the best rule in the class. This 
is quite independent of the fact that the best rule in the class has an error 
of 10%. To estimate this conditional probability we need to calibrate the 
predictions of our algorithm. One can devise various ways of performing 
this calibration. An interesting parameter-free calibration method has been 
recently suggested by Vovk [22] . 

Our work shares some ideas with the recent work by Shawe- Taylor and 
Williamson [20] and McAllester [16] on PAC-Bayesian analysis. The main 
common idea is that if many classification rules perform well, then their 
prediction can be trusted more than that of a single rule that is performing 
well. The main difference is that in our work we average over the predic- 
tions of the best rules and get a different prediction confidence for each test 
instance, while the PAC-Bayesian analysis uses the plurality of the good 
performers to improve the performance guarantees for a single classification 
rule that is chosen at random according to the posterior. Similar ideas were 
used in the analysis of large-margin classifiers. 

Another connection worth mentioning here is to margin based classifica- 
tion methods such as SVMs [19, 21] and boosting [10, 18]. One intuition 
that explains why large margins are important regards the stability of the 
linear classifier. Large margins around the separating hyperplane imply that 
slight perturbations of the hyperplane will also classify the data correctly. 
In other words, it implies that a large set of similar linear classifiers have 
small training error. Suppose now that we used the averaging algorithm sug- 
gested in this paper where the set of classifiers that is used is the set of all 
linear classifiers. The fact that the set of close-to-optimal classifiers is large 
implies that the prediction where they all agree would be very confident. 
On the other hand, the region on which the algorithm will abstain is similar 
(but not identical) to the margin region. In other words, the behavior of 
our algorithm is, in fact, similar to that of large margin classifiers. However, 
there are two important differences. On the one hand, the averaging algo- 
rithm is much more general in that it can be applied to any set of classifiers, 
not just linear classifiers; neither does it depend on whether or not the data 
is separable, that is, perfectly classifiable by one of the rules in the class. On 
the other hand, our algorithm is extremely inefficient as compared to SVMs 
or AdaBoost as its application requires calculating the empirical error for 
each and every rule in the set. 
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9. Future research. We suggest two directions for future work, one re- 
garding computational efficiency, the other regarding the choice of a prior 
distribution. 

Consider first the computational issue. For most interesting hypothesis 
classes the task of finding the hypothesis that minimizes the training error 
is computationally intractable. Obviously, calculating the error of all of the 
hypotheses in the class is at least as hard as finding the best hypothesis and 
probably much harder. Does this mean that our algorithm cannot be used 
for practical learning problems? Not necessarily. Here are three approaches 
to solving the computational problem: 

1. Sometimes the problem of learning a complex classification rule can be 
broken down into several problems of learning very simple rules. For ex- 
ample, Freund and Mason [9] show how to break down the problem of 
learning alternating decision trees (a class of rules which generalizes deci- 
sion trees and boosted decision trees) into a sequence of simpler learning 
problems using boosting. Each of the simpler problems involves finding 
the best threshold rule in one dimension. These last problems are so sim- 
ple the calculation can be done in time linear in the size of the training 
set. In this context our algorithm can be used directly and its use might 
significantly increase the robustness of the system as a whole. 

2. In some cases a careful choice of the prior distribution over the hypothe- 
ses makes it possible to calculate the posterior average efficiently. For 
example, conjugate priors commonly used in Bayesian statistics are prior 
distribution which maintain their functional form as they are updated. 
A more interesting case which involves variable-length Markov models 
for sequences was studied by Willems, Shtarkov and Tjalkens [23] and 
extended by Helmbold and Schapire [12]. It might be possible to adapt 
these techniques to efficiently calculate the empirical log ratio for our 
algorithm. 

3. In some cases the posterior distribution can be approximated by a single 
sharp peak around the best hypothesis. In such the empirical log 
ratio can be approximated using Laplace approximation method. This 
technique was used by Freund [8]. For an introduction to this type of 
approximation methods see the excellent book by de Bruijn [7]. 

4. Another approach to estimating the average vote over the empirically best 
hypothesis is to use random sampling. Suppose we are given a learning 
algorithm capable of finding a hypothesis with small training error. Our 
goal is to tweak the algorithm in a way that will randomly create a 
hypothesis whose performance is almost as good as the original untweaked 
hypothesis. Moreover, we want the distribution according to which the 
hypothesis is generated to be close to the distribution defined by our 
exponential weights. 
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There are several learning algorithms that sample hypotheses and av- 
erage them. The best known of these so-called ensemble algorithms is 
Breiman's bagging algorithm [4, 5]. It might be that bagging is indeed an 
efficient randomized algorithm of the type suggested here. On the other 
hand, it might be possible to adapt the theory presented in this paper to 
give a rigorous analysis for the performance of bagging and other ensem- 
ble methods. 

The second direction we suggest for future work is to consider the choice 
of the prior measure /x defined in Section 7. Clearly, the choice of measure 
has a large influence on the algorithm and on the upper bound given in 
Theorem 7. 

Intuitively, we would like to maximize the probability measure of the set 
Ve. However, we need to define the measure /z before observing the train- 
ing data, that is, before we know what Ve is. One natural approach is to 
maximize the minimum over the measure of all possible sets V^. 

Consider first a case in which we have prior knowledge of the distribution 
over the instances, without the labels. In this case we can use the measure 
which places uniform weights over an e-net on the hypothesis class, as was 
suggested in Section 5. This will ensure that if the best hypothesis in the 
class has error e*, then the set 14* +e will have measure at least where 
N is the size of the e-net. The disturbing thing about this choice for /i is 
that it depends on e. Possibly this disturbance can be cleared if one can 
use a limit distribution where e — > 0. Intuitively, such a limit measure will 
capture the detailed structure of the hypothesis space in a way similar to 
Jeffreys' prior in Bayesian analysis. 

Assuming that this analysis can be carried through, one should return to 
the original problem in which the distribution over the instances is unknown. 
In this case we need to approximate the "ideal" algorithm by using the 
information about the instance distribution that we get from the training 
examples. Ultimately, we would like to find an averaging algorithm whose 
performance is close to the averaging algorithm that has this prior knowledge 
and that is efficiently computable. 



APPENDIX: PROOF OF LEMMA 4 



First, let fC = {hi, . . . , /iat}, and let 




For any x, by checking derivatives, it can be verified that the function 
rj F(r],x) is nonincreasing, while the function rj F{rj,x) — (liiN)/rj is 
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(14) 

Now let 
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0<F(r/i,x)-F(772,x)< ( 1 - i- ) IniV. 



8 



41niV 



■.i = l. 



AlnN 



A 



We show next that for any > 1 , there exists fj ££ such that 

1 1 
rj 17 

For if ?7 > 4(lniV)/A, then let r/ = 4(lniV)/A. Then 



\nN<-. 

- 4 



1 1 



1 



A 



0< - - - InA^ < -lniV = -. 
\rj 1] J 7] 4 

Otherwise, if 1 < < 4(lnAf)/A, then let fj = 4(lniV)/(fA) be the smallest 
element of £■ that is no smaller than rj. That is, 

41niV 41niV 
< ?] < 



Then 



(i + l)A 



0< ( - - -1 lniV = 

< 



1 



iX 
iX 



r] AlnN 

(i + l)A 

41niV 41niV 



IniV 
iX 



InN 



_ X 
~ 1' 

Since RriiK,) = F{r], {i{hi), . . . ,i{hiy))), (14) and the argument above imply 
that for any > 1 , there exists fj ££ such that 

|i2,(/C)-i?^(/C)|<^ 



and so 



Thus, 



|(i?,(/C) -E[i?,(/C)]) - (i?^(/C) -E[i?^(/C)])| < -. 



Pr[3 7?> l:|i?;,(/C) -E[^^(/C)]/C| > A] 



< Pr 



3fi€£:\Rf,{}C)-E[R^{}C)] \ > 



X 



where the second inequality uses the union bound combined with Lemma 1. 
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